Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

نویسندگان

چکیده

Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach spam email problem from a different novel perspective. Focusing on needs cybersecurity units, follow topic-based for addressing classification into multiple categories. We propose SPEMC-15K-E SPEMC-15K-S, two datasets with approximately 15K each in English Spanish, respectively, label them using agglomerative hierarchical clustering 11 classes. evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag Words, Word2Vec BERT- classifiers: Support Vector Machine, Näive Bayes, Random Forest Logistic Regression. Experimental results show highest performance is achieved TF-IDF LR dataset, F1 score 0.953 an accuracy 94.6%, while Spanish NB yields 0.945 98.5% accuracy. Regarding processing time, leads to fastest classification, 2ms 2.2ms average, respectively.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementing Agglomerative hierarchical clustering using multiple attribute

Agglomerative hierarchical clustering algorithm used with top down approach. It implement with multiple attributes. In multiple attributes frequency calculation is allocated. Memory requirements are less in this process. Hierarchical clustering produce accurate result than any other algorithm. This is very less time consuming process.

متن کامل

Competence maps using agglomerative hierarchical clustering

Knowledge management from a strategic planning point of view often requires having an accurate understanding of a firm’s or a nation’s competences in a given technological discipline. Knowledge maps have been used for the purpose of discovering the location, ownership and value of intellectual assets. The purpose of this article is to develop a new method for assessing national and firmlevel co...

متن کامل

Modern hierarchical, agglomerative clustering algorithms

This paper presents algorithms for hierarchical, agglomerative clustering which perform most efficiently in the general-purpose setup that is given in modern standard software. Requirements are: (1) the input data is given by pairwise dissimilarities between data points, but extensions to vector data are also discussed (2) the output is a “stepwise dendrogram”, a data structure which is shared ...

متن کامل

Divisive Hierarchical Clustering with K-means and Agglomerative Hierarchical Clustering

To implement divisive hierarchical clustering algorithm with K-means and to apply Agglomerative Hierarchical Clustering on the resultant data in data mining where efficient and accurate result. In Hierarchical Clustering by finding the initial k centroids in a fixed manner instead of randomly choosing them. In which k centroids are chosen by dividing the one dimensional data of a particular clu...

متن کامل

Clustering Acoustic Segments Using Multi-Stage Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering becomes infeasible when applied to large datasets due to its O(N2) storage requirements. We present a multi-stage agglomerative hierarchical clustering (MAHC) approach aimed at large datasets of speech segments. The algorithm is based on an iterative divide-and-conquer strategy. The data is first split into independent subsets, each of which is clustered se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied Soft Computing

سال: 2023

ISSN: ['1568-4946', '1872-9681']

DOI: https://doi.org/10.1016/j.asoc.2023.110226